# Multimodal Transformer

Jedi 7B 1080p GGUF
Apache-2.0
An image-text to text generation model based on the Transformer architecture, designed specifically for computer/GUI-related scenarios, with intelligent agent capabilities.
Text-to-Image English
J
lmstudio-community
113
1
My Model
MIT
GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.
Image-to-Text PyTorch Supports Multiple Languages
M
anoushhka
87
0
Spaceexploreai Small Base Regression 27M
Apache-2.0
A deep learning-based investment prediction system utilizing Transformer architecture, integrating DeepSeep-V3 and LLama3 design structures for stock price trend forecasting and technical analysis.
Large Language Model Supports Multiple Languages
S
NEOAI
57
4
Microsoft Git Base
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.
Image-to-Text Supports Multiple Languages
M
seckmaster
18
0
Stable Diffusion 3.5 Large Turbo
Other
A text-to-image model based on Multimodal Diffusion Transformer (MMDiT), utilizing Adversarial Diffusion Distillation (ADD) technology to enhance image quality, typography, and complex prompt understanding.
Text-to-Image English
S
stabilityai
57.11k
581
Git Large Coco
MIT
GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.
Image-to-Text Transformers Supports Multiple Languages
G
alexgk
25
0
Git Base Finetune
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into descriptive text.
Image-to-Text Transformers Supports Multiple Languages
G
wangjin2000
18
0
Textcaps Teste2
MIT
GIT is a Transformer-based image-to-text generation model trained on large-scale image-text pairs, capable of performing tasks such as image captioning and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
T
artificialguybr
26
3
Git Large R Textcaps
MIT
GIT is a dual-conditioned Transformer decoder based on CLIP image tokens and text tokens, designed for tasks such as image caption generation and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
51
10
Git Large R Coco
MIT
GIT is a Transformer-based generative image-to-text model capable of generating descriptive text from images.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
86
10
Git Large Vatex
MIT
GIT is a Transformer decoder conditioned on CLIP image tokens and text tokens, designed for tasks like image and video caption generation and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
267
1
Git Large Textvqa
MIT
GIT is a vision-language model based on a Transformer decoder, trained with dual conditioning on CLIP image tokens and text tokens, specifically optimized for TextVQA tasks.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
62
4
Git Large Vqav2
MIT
GIT is a Transformer decoder based on CLIP image tokens and text tokens, trained on large-scale image-text pairs, suitable for tasks like visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
401
17
Git Large Textcaps
MIT
GIT is a dual-conditional decoder model based on Transformer, designed for tasks such as image caption generation and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
1,749
28
Git Large Coco
MIT
GIT is a Transformer decoder-based vision-language model capable of generating image captions and performing visual question answering
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
6,582
103
Git Base Vatex
MIT
GIT is a Transformer-based generative image-to-text model, with the base version fine-tuned on the VATEX dataset, suitable for tasks such as image and video caption generation.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
752
4
Git Large
MIT
GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens for image-to-text generation tasks
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
1,404
15
Git Base Vqav2
MIT
GIT is a Transformer decoder-based vision-language model trained with CLIP image tokens and text token conditioning, suitable for tasks like image captioning and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
199
19
Git Base Textcaps
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into descriptive text.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
482
8
Git Base Coco
MIT
GIT is a Transformer decoder based on CLIP image tokens and text tokens, used for tasks such as image caption generation and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
5,461
19
Vision Perceiver Conv
Apache-2.0
A general-purpose vision perceiver model pre-trained on ImageNet, utilizing convolutional preprocessing and Transformer architecture, supporting image classification tasks
Image Classification Transformers
V
deepmind
7,127
6
S2t Small Mustc En Es St
MIT
A speech-to-text transformer model for end-to-end English to Spanish speech translation
Speech Recognition Transformers Supports Multiple Languages
S
facebook
20
0
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase